Multimodal Translation System Using Texture-Mapped Lip-Sync Images for Video Mail and Automatic Dubbing Applications
نویسندگان
چکیده
We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker’s speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker’s face in an image sequence. To retain the speaker’s facial expression, we substitute only the speech organ’s image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.
منابع مشابه
Face translation: A multimodal translation agent
In this paper, we present Face Translation, a translation agent for people who speak different languages. The system can not only translate a spoken utterance into another language, but also produce an audio-visual output with the speaker’s face and synchronized lip movement. The visual output is synthesized from real images based on image morphing technology. Both mouth and eye movements are g...
متن کاملChapter 16 JOINT AUDIO - VIDEO PROCESSING FOR ROBUST BIOMETRIC SPEAKER IDENTIFICATION IN CAR 1
In this chapter, we present our recent results on the multilevel Bayesian decision fusion scheme for multimodal audio-visual speaker identification problem. The objective is to improve the recognition performance over conventional decision fusion schemes. The proposed system decomposes the information existing in a video stream into three components: speech, lip trace and face texture. Lip trac...
متن کاملAV@CAR: A Spanish Multichannel Multimodal Corpus for In-Vehicle Automatic Audio-Visual Speech Recognition
This paper describes the acquisition of the multichannel multimodal database AV@CAR for automatic audio-visual speech recognition in cars. Automatic speech recognition (ASR) plays an important role inside vehicles to keep the driver away from distraction. It is also known that visual information (lip-reading) can improve accuracy in ASR under adverse conditions as those within a car. The corpus...
متن کاملPerformance Enhancement in Lip Synchronization Using MFCC Parameters
Many multimedia applications and entertainment industry products like games, cartoons and film dubbing require speech driven face animation and audio-video synchronization. Only Automatic Speech Recognition system (ASR) does not give good results in noisy environment. Audio Visual Speech Recognition system plays vital role in such harsh environment as it uses both – audio and visual – informati...
متن کاملFace Translation
In this paper, we present Face Translation, a translation agent for people who speak different languages. The system can not only translate a spoken utterance into another language, but also produce an audio-visual output with the speaker’s face and synchronized lip movement. The visual output is synthesized from real images based on image morphing technology. Both mouth and eye movements are g...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- EURASIP J. Adv. Sig. Proc.
دوره 2004 شماره
صفحات -
تاریخ انتشار 2004